Correcting a POS-Tagged Corpus Using Three Complementary Methods

نویسنده

  • Hrafn Loftsson
چکیده

The quality of the part-of-speech (PoS) annotation in a corpus is crucial for the development of PoS taggers. In this paper, we experiment with three complementary methods for automatically detecting errors in the PoS annotation for the Icelandic Frequency Dictionary corpus. The first two methods are language independent and we argue that the third method can be adapted to other morphologically complex languages. Once possible errors have been detected, we examine each error candidate and hand-correct the corresponding PoS tag if necessary. Overall, based on the three methods, we handcorrect the PoS tagging of 1,334 tokens (0.23% of the tokens) in the corpus. Furthermore, we re-evaluate existing state-ofthe-art PoS taggers on Icelandic text using the corrected corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TALC-sef A Manually-Revised POS-TAgged Literary Corpus in Serbian, English and French

In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from th...

متن کامل

Morphology Driven Manipuri POS Tagger

A good POS tagger is a critical component of a machine translation system and other related NLP applications where an appropriate POS tag will be assigned to individual words in a collection of texts. There is not enough POS tagged corpus available in Manipuri language ruling out machine learning approaches for a POS tagger in the language. A morphology driven Manipuri POS tagger that uses thre...

متن کامل

Détection et correction automatique d'erreurs d'annotation morpho-syntaxique du French TreeBank (Detecting and Correcting POS Annotation in the French TreeBank) [in French]

Detecting and correcting POS annotation in the French TreeBank The quality of the Part-Of-Speech (POS) annotation in a corpus has a large impact on training and evaluating POS taggers. In this paper, we present a series of experiments that we have conducted on automatically detecting and correcting annotation errors in the French TreeBank. Two methods are used. The first simply relies on identi...

متن کامل

TCOF-POS : un corpus libre de français parlé annoté en morphosyntaxe (TCOF-POS : A Freely Available POS-Tagged Corpus of Spoken French) [in French]

TCOF-POS : A Freely Available POS-Tagged Corpus of Spoken French This article details the creation of TCOF-POS, the first freely available corpus of spontaneous spoken French. We present here the methodology that was followed in order to obtain the best possible quality in the final resource. This corpus already is freely available and can be used as a training/validation corpus for NLP tools, ...

متن کامل

Web-Based Bengali News Corpus for Lexicon Development and POS Tagging

Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing (NLP) applications. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. We have used a Bengali news corpus, developed from the web archive of a widely read Bengali newspaper. The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009